Exploratory Data Analysis (EDA)
Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests?
The aim is to create meaningful estimators from the data set we have and to select the model that predicts the cancellation best by comparing them with the accuracy scores of different ML models.
This dataset contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
From the publication sciencedirect we know that:
Both hotels are located in Portugal (southern Europe) ("H1 at the resort region of Algarve and H2 at the city of Lisbon"). The distance between these two locations is ca. 280 km by car and both locations border on the north atlantic.
The 'adr' column standas for (Average Daily Rate) and calculated by dividing the sum of all lodging transactions by the total number of staying nights
The data contains "bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017".
# for ignoring all juypter warnings
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
pd.set_option('display.max_columns', 100)
import numpy as np
# for creating graphs
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mat
%matplotlib inline
# for sorting by month name
import sort_dataframeby_monthorweek as sd
######################## for modeling ########################
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
df = pd.read_csv('hotel_bookings.csv')
print(df.shape)
df.sample(10)
(119390, 32)
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16362 | Resort Hotel | 0 | 0 | 2015 | August | 35 | 24 | 1 | 1 | 2 | 0.0 | 0 | BB | FRA | Direct | Direct | 0 | 0 | 0 | A | A | 0 | No Deposit | NaN | NaN | 0 | Transient | 106.00 | 0 | 0 | Check-Out | 2015-08-26 |
| 69778 | City Hotel | 1 | 37 | 2017 | June | 23 | 6 | 0 | 1 | 2 | 0.0 | 0 | SC | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 9.0 | NaN | 0 | Transient | 120.00 | 0 | 2 | Canceled | 2017-05-04 |
| 20836 | Resort Hotel | 0 | 23 | 2016 | February | 6 | 6 | 2 | 3 | 2 | 0.0 | 0 | Undefined | PRT | Direct | Direct | 0 | 0 | 0 | E | E | 0 | No Deposit | NaN | NaN | 0 | Transient | 98.80 | 1 | 1 | Check-Out | 2016-02-11 |
| 59721 | City Hotel | 1 | 166 | 2016 | November | 45 | 1 | 0 | 3 | 2 | 0.0 | 0 | BB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | E | E | 0 | Non Refund | 236.0 | NaN | 0 | Transient | 130.00 | 0 | 0 | Canceled | 2016-07-13 |
| 118743 | City Hotel | 0 | 108 | 2017 | August | 34 | 20 | 2 | 3 | 1 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | A | A | 0 | No Deposit | 14.0 | NaN | 0 | Transient | 112.50 | 0 | 0 | Check-Out | 2017-08-25 |
| 71669 | City Hotel | 1 | 187 | 2017 | July | 28 | 9 | 2 | 3 | 2 | 0.0 | 0 | BB | ROU | Online TA | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 9.0 | NaN | 0 | Transient | 120.60 | 0 | 0 | Canceled | 2017-01-26 |
| 115158 | City Hotel | 0 | 122 | 2017 | June | 26 | 28 | 0 | 4 | 2 | 0.0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | D | D | 1 | No Deposit | 9.0 | NaN | 0 | Transient | 131.03 | 0 | 0 | Check-Out | 2017-07-02 |
| 30344 | Resort Hotel | 0 | 0 | 2016 | November | 47 | 18 | 0 | 2 | 2 | 0.0 | 0 | HB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | D | 0 | No Deposit | 240.0 | NaN | 0 | Transient | 95.00 | 0 | 0 | Check-Out | 2016-11-20 |
| 11992 | Resort Hotel | 1 | 304 | 2017 | June | 23 | 8 | 2 | 3 | 2 | 0.0 | 0 | HB | PRT | Groups | TA/TO | 0 | 0 | 0 | A | A | 0 | Non Refund | 298.0 | NaN | 0 | Transient | 94.00 | 0 | 0 | Canceled | 2017-04-05 |
| 79395 | City Hotel | 0 | 11 | 2015 | October | 43 | 23 | 2 | 2 | 2 | 0.0 | 0 | BB | ITA | Offline TA/TO | TA/TO | 0 | 0 | 0 | A | A | 1 | No Deposit | 26.0 | NaN | 0 | Transient-Party | 88.50 | 0 | 0 | Check-Out | 2015-10-27 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119390 entries, 0 to 119389 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 119390 non-null object 1 is_canceled 119390 non-null int64 2 lead_time 119390 non-null int64 3 arrival_date_year 119390 non-null int64 4 arrival_date_month 119390 non-null object 5 arrival_date_week_number 119390 non-null int64 6 arrival_date_day_of_month 119390 non-null int64 7 stays_in_weekend_nights 119390 non-null int64 8 stays_in_week_nights 119390 non-null int64 9 adults 119390 non-null int64 10 children 119386 non-null float64 11 babies 119390 non-null int64 12 meal 119390 non-null object 13 country 118902 non-null object 14 market_segment 119390 non-null object 15 distribution_channel 119390 non-null object 16 is_repeated_guest 119390 non-null int64 17 previous_cancellations 119390 non-null int64 18 previous_bookings_not_canceled 119390 non-null int64 19 reserved_room_type 119390 non-null object 20 assigned_room_type 119390 non-null object 21 booking_changes 119390 non-null int64 22 deposit_type 119390 non-null object 23 agent 103050 non-null float64 24 company 6797 non-null float64 25 days_in_waiting_list 119390 non-null int64 26 customer_type 119390 non-null object 27 adr 119390 non-null float64 28 required_car_parking_spaces 119390 non-null int64 29 total_of_special_requests 119390 non-null int64 30 reservation_status 119390 non-null object 31 reservation_status_date 119390 non-null object dtypes: float64(4), int64(16), object(12) memory usage: 29.1+ MB
df.describe()
| is_canceled | lead_time | arrival_date_year | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | agent | company | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119386.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 103050.000000 | 6797.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 |
| mean | 0.370416 | 104.011416 | 2016.156554 | 27.165173 | 15.798241 | 0.927599 | 2.500302 | 1.856403 | 0.103890 | 0.007949 | 0.031912 | 0.087118 | 0.137097 | 0.221124 | 86.693382 | 189.266735 | 2.321149 | 101.831122 | 0.062518 | 0.571363 |
| std | 0.482918 | 106.863097 | 0.707476 | 13.605138 | 8.780829 | 0.998613 | 1.908286 | 0.579261 | 0.398561 | 0.097436 | 0.175767 | 0.844336 | 1.497437 | 0.652306 | 110.774548 | 131.655015 | 17.594721 | 50.535790 | 0.245291 | 0.792798 |
| min | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | -6.380000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 18.000000 | 2016.000000 | 16.000000 | 8.000000 | 0.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 62.000000 | 0.000000 | 69.290000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 69.000000 | 2016.000000 | 28.000000 | 16.000000 | 1.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.000000 | 179.000000 | 0.000000 | 94.575000 | 0.000000 | 0.000000 |
| 75% | 1.000000 | 160.000000 | 2017.000000 | 38.000000 | 23.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 229.000000 | 270.000000 | 0.000000 | 126.000000 | 0.000000 | 1.000000 |
| max | 1.000000 | 737.000000 | 2017.000000 | 53.000000 | 31.000000 | 19.000000 | 50.000000 | 55.000000 | 10.000000 | 10.000000 | 1.000000 | 26.000000 | 72.000000 | 21.000000 | 535.000000 | 543.000000 | 391.000000 | 5400.000000 | 8.000000 | 5.000000 |
# nulls_percent =pd.DataFrame({'Feature':df_1.columns,
# 'number of nulls': df_1.isna().sum(),
# 'Precentage of nulls': (df_1.isna().sum()/df_1.shape[0])*100})
# nulls_percent = nulls_percent.reset_index(drop=True)
# nulls_percent
for col in df.columns:
s = df[col].isna().sum()
per= (df[col].isna().sum()/df[col].shape[0])*100
if s > 0:
print("column: {:30s} Nulls: {:6d} {:15s} Precentage: {:2.2f}%".format(col,s,'',per))
column: children Nulls: 4 Precentage: 0.00% column: country Nulls: 488 Precentage: 0.41% column: agent Nulls: 16340 Precentage: 13.69% column: company Nulls: 112593 Precentage: 94.31%
A 94.31% of company column are missing values. Therefore we do not have enough values to fill the rows of company column by predicting, filling by mean etc. It seems that the best option is dropping company column.
df_1 = df.drop(['company'],axis=1)
col = 'agent'
length = len(df_1[col].unique())
print('There {} unique categorical values in {} column'.format(length, col))
There 334 unique categorical values in agent column
A 13.69% of agent column are missing values, there is no need to drop agent column. But also we should not drop the rows because 13.69% of data is really huge amount and those rows have the chance to have crucial information. There are 334 unique agent, since there are too many agents they may not be predictable.
I will decide what to do about agent after correlation section.
We have also 4 missing values in children column. If there is no information about children those customers do not have any children.
Fill nulls by 0 values.
df_1['children'] = df_1['children'].fillna(0)
We have also only 0.41% missing values in country column. we can simply drop them.
indices = df_1.loc[pd.isna(df_1["country"]), :].index
df_1 = df_1.drop(df_1.index[indices])
print('There are {} nulls in county column'.format(pd.isna(df_1["country"]).sum()))
There are 0 nulls in county column
df_1['hotel'].unique()
array(['Resort Hotel', 'City Hotel'], dtype=object)
# Number of guests per column for the 2 Hotels
def total_guests(df, hotels, by):
total_guests = pd.DataFrame()
for h in hotels:
hotel_df = df[df['hotel']== h]
hotel_df = pd.DataFrame({by:by,h: hotel_df[by].value_counts(ascending=False)})
total_guests = total_guests.append(hotel_df)
total_guests[by] = total_guests.index
total_guests = total_guests.groupby(by).agg('sum')
total_guests.insert(0,by,total_guests.index)
total_guests = total_guests.reset_index(drop=True)
try:
total_guests = sd.Sort_Dataframeby_Month(total_guests,by)
except:
pass
return total_guests
by = 'arrival_date_month'
## Number of guests for each month for the 2 Hotels
guests_per_month = total_guests(df_1, df_1['hotel'].unique(),by)
guests_per_month
| arrival_date_month | Resort Hotel | City Hotel | |
|---|---|---|---|
| 0 | January | 2138.0 | 3736.0 |
| 1 | February | 3047.0 | 4965.0 |
| 2 | March | 3281.0 | 6458.0 |
| 3 | April | 3569.0 | 7476.0 |
| 4 | May | 3547.0 | 8232.0 |
| 5 | June | 3033.0 | 7894.0 |
| 6 | July | 4540.0 | 8088.0 |
| 7 | August | 4873.0 | 8983.0 |
| 8 | September | 3067.0 | 7400.0 |
| 9 | October | 3504.0 | 7591.0 |
| 10 | November | 2398.0 | 4354.0 |
| 11 | December | 2599.0 | 4129.0 |
px.line(
guests_per_month, x = by, y=['Resort Hotel','City Hotel'],
title = 'Number of guests for each month'.upper(),
labels={ 'value':"number of guests".title(),
by:'Month',
'variable':'Hotel'.title()}
)
mystyle = plt.style.library['fivethirtyeight']
# plt.style.use('default')
# fig, ax = plt.subplots(figsize=(15,6))
# guests_per_month.plot.bar(ax=ax,width = 0.8)
# ax.set_xticklabels(guests_per_month['Month'], fontsize=14, rotation = 30)
# ax.set_title('Number of guests for each month'.upper(), fontsize=24)
# ax.set_xlabel('month'.title(), fontsize=20)
# ax.set_ylabel('number of guests'.title(), fontsize=20)
# ax.legend(fontsize=15)
# plt.show()
df_2 = pd.DataFrame()
df_2['dur_per_nights']= df_1['stays_in_weekend_nights'] + df_1['stays_in_week_nights']
df_2 = pd.concat([df_1['hotel'],df_2], axis = 1)
df_2
| hotel | dur_per_nights | |
|---|---|---|
| 0 | Resort Hotel | 0 |
| 1 | Resort Hotel | 0 |
| 2 | Resort Hotel | 1 |
| 3 | Resort Hotel | 1 |
| 4 | Resort Hotel | 2 |
| ... | ... | ... |
| 119385 | City Hotel | 7 |
| 119386 | City Hotel | 7 |
| 119387 | City Hotel | 7 |
| 119388 | City Hotel | 7 |
| 119389 | City Hotel | 9 |
118902 rows × 2 columns
total_nights_per_hotel = df_2.groupby(['hotel','dur_per_nights']).agg({'dur_per_nights':'count'})
total_nights_per_hotel.columns=['n_trans']
total_nights_per_hotel.insert(0,'hotel', total_nights_per_hotel.index.get_level_values(0))
total_nights_per_hotel.insert(1,'dur_per_nights', total_nights_per_hotel.index.get_level_values(1))
total_nights_per_hotel = total_nights_per_hotel.reset_index(drop=True)
total_nights_per_hotel
| hotel | dur_per_nights | n_trans | |
|---|---|---|---|
| 0 | City Hotel | 0 | 324 |
| 1 | City Hotel | 1 | 13272 |
| 2 | City Hotel | 2 | 21426 |
| 3 | City Hotel | 3 | 21366 |
| 4 | City Hotel | 4 | 12556 |
| ... | ... | ... | ... |
| 69 | Resort Hotel | 38 | 1 |
| 70 | Resort Hotel | 42 | 4 |
| 71 | Resort Hotel | 45 | 1 |
| 72 | Resort Hotel | 46 | 1 |
| 73 | Resort Hotel | 56 | 2 |
74 rows × 3 columns
fig = px.histogram(
total_nights_per_hotel[total_nights_per_hotel['dur_per_nights']<=15], x = 'dur_per_nights',y='n_trans', color='hotel',
title = 'Number of transactions per number nights duration'.upper(),nbins=15,
labels={ 'n_trans':"transactions",
'dur_per_nights':'Duration by nights',
'hotel':'Hotel Name',}
)
fig.update_layout(
xaxis = dict(
tick0 = 1,
dtick = 1
)
)
Most people do not seem to prefer to stay at the hotel for more than 1 week. But it seems normal to stay in Resort hotels for up to 15 days.
# # Number of guests for each month in the Resort Hotel
# def hotel_guests_per_month(df, hotel, by):
# hotel_df = df[df['hotel']== hotel]
# hotel_df = pd.DataFrame({by:by,hotel: hotel_df[by].value_counts(ascending=False)})
# hotel_df[by] = hotel_df.index
# hotel_df = hotel_df.reset_index(drop=True)
# return hotel_df
# number of cancelation per hotel and a spasific column
def cancelation_per_hotel(df, hotel, by):
hotel_df= df[df['hotel'] == hotel]
hotel_df = hotel_df.groupby([by,'is_canceled']).agg({'is_canceled':'count'})
hotel_df.rename({'is_canceled':'value'}, axis=1,inplace = True)
hotel_df.insert(0,by,hotel_df.index.get_level_values(0))
hotel_df.insert(1,'cancelation',hotel_df.index.get_level_values(1))
hotel_df['cancelation'] = hotel_df['cancelation'].apply(lambda x:'canceled' if x == 1 else 'Not canceled')
hotel_df = hotel_df.reset_index(drop = True)
try:
hotel_df = sd.Sort_Dataframeby_Month(hotel_df,by)
except:
pass
return hotel_df
by = 'arrival_date_month'
hotel = 'City Hotel'
City_df1 = cancelation_per_hotel(df_1, hotel,by)
City_df1
| arrival_date_month | cancelation | value | |
|---|---|---|---|
| 0 | January | Not canceled | 2254 |
| 1 | January | canceled | 1482 |
| 2 | February | Not canceled | 3064 |
| 3 | February | canceled | 1901 |
| 4 | March | Not canceled | 4072 |
| 5 | March | canceled | 2386 |
| 6 | April | Not canceled | 4015 |
| 7 | April | canceled | 3461 |
| 8 | May | Not canceled | 4579 |
| 9 | May | canceled | 3653 |
| 10 | June | Not canceled | 4366 |
| 11 | June | canceled | 3528 |
| 12 | July | Not canceled | 4782 |
| 13 | July | canceled | 3306 |
| 14 | August | Not canceled | 5381 |
| 15 | August | canceled | 3602 |
| 16 | September | Not canceled | 4290 |
| 17 | September | canceled | 3110 |
| 18 | October | Not canceled | 4337 |
| 19 | October | canceled | 3254 |
| 20 | November | Not canceled | 2694 |
| 21 | November | canceled | 1660 |
| 22 | December | Not canceled | 2392 |
| 23 | December | canceled | 1737 |
px.bar(City_df1,x=by,y='value', color = 'cancelation',barmode='group',
labels={ 'cancelation':"Cancelation",by:'Month'},
title = 'Number of cancelation per month for {}'.format(hotel).upper())
by = 'arrival_date_month'
hotel = 'Resort Hotel'
Resort_df1 = cancelation_per_hotel(df_1, hotel,by)
Resort_df1
| arrival_date_month | cancelation | value | |
|---|---|---|---|
| 0 | January | Not canceled | 1814 |
| 1 | January | canceled | 324 |
| 2 | February | Not canceled | 2253 |
| 3 | February | canceled | 794 |
| 4 | March | Not canceled | 2519 |
| 5 | March | canceled | 762 |
| 6 | April | Not canceled | 2518 |
| 7 | April | canceled | 1051 |
| 8 | May | Not canceled | 2523 |
| 9 | May | canceled | 1024 |
| 10 | June | Not canceled | 2027 |
| 11 | June | canceled | 1006 |
| 12 | July | Not canceled | 3110 |
| 13 | July | canceled | 1430 |
| 14 | August | Not canceled | 3237 |
| 15 | August | canceled | 1636 |
| 16 | September | Not canceled | 2077 |
| 17 | September | canceled | 990 |
| 18 | October | Not canceled | 2530 |
| 19 | October | canceled | 974 |
| 20 | November | Not canceled | 1938 |
| 21 | November | canceled | 460 |
| 22 | December | Not canceled | 1973 |
| 23 | December | canceled | 626 |
px.bar(Resort_df1,x=by,y='value', color = 'cancelation',barmode='group',
labels={ 'cancelation':"Cancelation",by:'Month'},
title = 'Number of cancelation per month for {}'.format(hotel).upper())
by = 'customer_type'
hotel = 'City Hotel'
City_df2 = cancelation_per_hotel(df_1, hotel,by)
City_df2
| customer_type | cancelation | value | |
|---|---|---|---|
| 0 | Contract | Not canceled | 1195 |
| 1 | Contract | canceled | 1105 |
| 2 | Group | Not canceled | 263 |
| 3 | Group | canceled | 29 |
| 4 | Transient | Not canceled | 32306 |
| 5 | Transient | canceled | 27076 |
| 6 | Transient-Party | Not canceled | 12462 |
| 7 | Transient-Party | canceled | 4870 |
px.bar(City_df2,x=by,y='value', color = 'cancelation',barmode='group',
labels={ 'cancelation':"Cancelation",by:'Customer Type'},
title = 'Number of cancelation per Customer Type for {}'.format(hotel).upper())
by = 'customer_type'
hotel = 'Resort Hotel'
Resort_df2 = cancelation_per_hotel(df_1, hotel,by)
Resort_df2
| customer_type | cancelation | value | |
|---|---|---|---|
| 0 | Contract | Not canceled | 1619 |
| 1 | Contract | canceled | 157 |
| 2 | Group | Not canceled | 249 |
| 3 | Group | canceled | 29 |
| 4 | Transient | Not canceled | 20408 |
| 5 | Transient | canceled | 9384 |
| 6 | Transient-Party | Not canceled | 6243 |
| 7 | Transient-Party | canceled | 1507 |
px.bar(Resort_df2,x=by,y='value', color = 'cancelation',barmode='group',
labels={ 'cancelation':"Cancelation",by:'Customer Type'},
title = 'Number of cancelation per Customer Type for {}'.format(hotel).upper())
by = 'days_in_waiting_list'
df_3 = df_1[(df_1['is_canceled']==1)]
## Number of cancelation per days_in_waiting_list for the 2 Hotels
wating_df = total_guests(df_3, df_3['hotel'].unique(),by)
wating_df
| days_in_waiting_list | Resort Hotel | City Hotel | |
|---|---|---|---|
| 0 | 0 | 11060.0 | 30738.0 |
| 1 | 1 | 1.0 | 2.0 |
| 2 | 2 | 0.0 | 1.0 |
| 3 | 3 | 0.0 | 59.0 |
| 4 | 4 | 0.0 | 8.0 |
| ... | ... | ... | ... |
| 100 | 224 | 0.0 | 6.0 |
| 101 | 236 | 0.0 | 6.0 |
| 102 | 330 | 0.0 | 1.0 |
| 103 | 379 | 0.0 | 9.0 |
| 104 | 391 | 0.0 | 45.0 |
105 rows × 3 columns
hotels = wating_df.columns[1:]
hotels
Index(['Resort Hotel', 'City Hotel'], dtype='object')
px.scatter(wating_df.loc[1:,:], x=by,y=hotels,
labels={'variable':"Hotel Name",by:'Days in wating list',
'value': 'Number of cancelation'},
title = 'Number of cancelation per waiting days for both hotels'.upper()
)
For City Hotel, Cancelation Rate decreased when the number of days in waing list increase.
For Resort Hotel, Cancelation Rate very low
zero_day_wating = wating_df.iloc[0,1:]
zero_day_wating = pd.DataFrame(zero_day_wating)
zero_day_wating
| 0 | |
|---|---|
| Resort Hotel | 11060.0 |
| City Hotel | 30738.0 |
rem_day_waiting = pd.DataFrame(wating_df.iloc[1:,1:])
rem_day_waiting = rem_day_waiting.sum()
rem_day_waiting = pd.DataFrame(rem_day_waiting)
rem_day_waiting
| 0 | |
|---|---|
| Resort Hotel | 17.0 |
| City Hotel | 2342.0 |
n_day_waiting = pd.concat([zero_day_wating,rem_day_waiting],axis = 1).T
n_day_waiting.reset_index(drop=True)
n_day_waiting.insert(0,'Wating days',['0', 'N'])
n_day_waiting
| Wating days | Resort Hotel | City Hotel | |
|---|---|---|---|
| 0 | 0 | 11060.0 | 30738.0 |
| 0 | N | 17.0 | 2342.0 |
px.bar(n_day_waiting, x='Wating days',y=hotels, barmode='group',
labels={'variable':"Hotel Name",'value': 'Number of cancelation'},
title = 'Number of cancelation per waiting days for both hotels'.upper()
)
Number of cancelation for the both hotels when wating days = 0 is a huge number.
Adding the following features to the dataset (is_family, total_customer, deposit_given, total_nights)
df_1.deposit_type.unique()
array(['No Deposit', 'Refundable', 'Non Refund'], dtype=object)
def is_family(df):
if ((df['adults'] > 0) & (df['children'] > 0)):
return 1
elif ((df['adults'] > 0) & (df['babies'] > 0)):
return 1
else:
return 0
def is_deposit(df):
if (df['deposit_type'] == 'No Deposit') | (df['deposit_type'] == 'Refundable'):
return 0
else:
return 1
df_5=df_1.copy()
df_5['is_family'] = df_5.apply(is_family, axis = 1)
df_5["total_customer"] = df_5["adults"] + df_5["children"] + df_5["babies"]
df_5["deposit_given"] = df_5.apply(is_deposit, axis=1)
df_5["total_nights"] = df_5["stays_in_weekend_nights"]+ df_5["stays_in_week_nights"]
df_5.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | is_family | total_customer | deposit_given | total_nights | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 | 0 | 2.0 | 0 | 0 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 | 0 | 2.0 | 0 | 0 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 | 0 | 1.0 | 0 | 1 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304.0 | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 | 0 | 1.0 | 0 | 1 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240.0 | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 | 0 | 2.0 | 0 | 2 |
I created new features more expressive than this ones so I'll drop the following columns.
['adults', 'babies', 'children', 'deposit_type', 'reservation_status_date']
df_5.shape
(118902, 35)
df_6 = df_5.drop(columns = ['adults', 'babies', 'children', 'deposit_type', 'reservation_status_date'])
df_6.shape
(118902, 30)
df_6.dtypes
hotel object is_canceled int64 lead_time int64 arrival_date_year int64 arrival_date_month object arrival_date_week_number int64 arrival_date_day_of_month int64 stays_in_weekend_nights int64 stays_in_week_nights int64 meal object country object market_segment object distribution_channel object is_repeated_guest int64 previous_cancellations int64 previous_bookings_not_canceled int64 reserved_room_type object assigned_room_type object booking_changes int64 agent float64 days_in_waiting_list int64 customer_type object adr float64 required_car_parking_spaces int64 total_of_special_requests int64 reservation_status object is_family int64 total_customer float64 deposit_given int64 total_nights int64 dtype: object
df_7 = df_6.copy()
df_7['hotel'] = df_7['hotel'].map({'Resort Hotel':0, 'City Hotel':1})
df_7['arrival_date_month'] = df_7['arrival_date_month'].map({'January':1, 'February': 2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7,
'August':8, 'September':9, 'October':10, 'November':11, 'December':12})
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_7['meal'] = le.fit_transform(df_7['meal'])
df_7['distribution_channel'] = le.fit_transform(df_7['distribution_channel'])
df_7['reserved_room_type'] = le.fit_transform(df_7['reserved_room_type'])
df_7['assigned_room_type'] = le.fit_transform(df_7['assigned_room_type'])
df_7['agent'] = le.fit_transform(df_7['agent'])
df_7['customer_type'] = le.fit_transform(df_7['customer_type'])
df_7['reservation_status'] = le.fit_transform(df_7['reservation_status'])
df_7['market_segment'] = le.fit_transform(df_7['market_segment'])
df_7['country'] = le.fit_transform(df_7['country'])
df_7
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | agent | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | is_family | total_customer | deposit_given | total_nights | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 342 | 2015 | 7 | 27 | 1 | 0 | 0 | 0 | 135 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 3 | 332 | 0 | 2 | 0.00 | 0 | 0 | 1 | 0 | 2.0 | 0 | 0 |
| 1 | 0 | 0 | 737 | 2015 | 7 | 27 | 1 | 0 | 0 | 0 | 135 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 4 | 332 | 0 | 2 | 0.00 | 0 | 0 | 1 | 0 | 2.0 | 0 | 0 |
| 2 | 0 | 0 | 7 | 2015 | 7 | 27 | 1 | 0 | 1 | 0 | 59 | 3 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 332 | 0 | 2 | 75.00 | 0 | 0 | 1 | 0 | 1.0 | 0 | 1 |
| 3 | 0 | 0 | 13 | 2015 | 7 | 27 | 1 | 0 | 1 | 0 | 59 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 220 | 0 | 2 | 75.00 | 0 | 0 | 1 | 0 | 1.0 | 0 | 1 |
| 4 | 0 | 0 | 14 | 2015 | 7 | 27 | 1 | 0 | 2 | 0 | 59 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 173 | 0 | 2 | 98.00 | 0 | 1 | 1 | 0 | 2.0 | 0 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 119385 | 1 | 0 | 23 | 2017 | 8 | 35 | 30 | 2 | 5 | 0 | 15 | 5 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 271 | 0 | 2 | 96.14 | 0 | 0 | 1 | 0 | 2.0 | 0 | 7 |
| 119386 | 1 | 0 | 102 | 2017 | 8 | 35 | 31 | 2 | 5 | 0 | 56 | 6 | 3 | 0 | 0 | 0 | 4 | 4 | 0 | 8 | 0 | 2 | 225.43 | 0 | 2 | 1 | 0 | 3.0 | 0 | 7 |
| 119387 | 1 | 0 | 34 | 2017 | 8 | 35 | 31 | 2 | 5 | 0 | 43 | 6 | 3 | 0 | 0 | 0 | 3 | 3 | 0 | 8 | 0 | 2 | 157.71 | 0 | 4 | 1 | 0 | 2.0 | 0 | 7 |
| 119388 | 1 | 0 | 109 | 2017 | 8 | 35 | 31 | 2 | 5 | 0 | 59 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 77 | 0 | 2 | 104.40 | 0 | 0 | 1 | 0 | 2.0 | 0 | 7 |
| 119389 | 1 | 0 | 205 | 2017 | 8 | 35 | 29 | 2 | 7 | 2 | 43 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 2 | 151.20 | 0 | 2 | 1 | 0 | 2.0 | 0 | 9 |
118902 rows × 30 columns
df_7.columns
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
'arrival_date_month', 'arrival_date_week_number',
'arrival_date_day_of_month', 'stays_in_weekend_nights',
'stays_in_week_nights', 'meal', 'country', 'market_segment',
'distribution_channel', 'is_repeated_guest', 'previous_cancellations',
'previous_bookings_not_canceled', 'reserved_room_type',
'assigned_room_type', 'booking_changes', 'agent',
'days_in_waiting_list', 'customer_type', 'adr',
'required_car_parking_spaces', 'total_of_special_requests',
'reservation_status', 'is_family', 'total_customer', 'deposit_given',
'total_nights'],
dtype='object')
corr_df = df_7.corr()
corr_df
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | agent | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | is_family | total_customer | deposit_given | total_nights | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| hotel | 1.000000 | 0.133990 | 0.071842 | 0.033724 | 0.001231 | 0.000739 | -0.002129 | -0.189729 | -0.237767 | 0.006087 | -0.041299 | 0.077583 | 0.167945 | -0.051353 | -0.012334 | 0.000745 | -0.252232 | -0.307583 | -0.073166 | -0.557676 | 0.072011 | 0.047190 | 0.093117 | -0.217408 | -0.043983 | -0.121841 | -0.059504 | -0.045235 | 0.170543 | -0.251797 |
| is_canceled | 0.133990 | 1.000000 | 0.291940 | 0.016339 | 0.010325 | 0.007481 | -0.006173 | -0.002639 | 0.024103 | -0.018679 | 0.270254 | 0.056972 | 0.165596 | -0.085185 | 0.109914 | -0.055495 | -0.062228 | -0.175882 | -0.144669 | -0.127336 | 0.054008 | -0.068698 | 0.046133 | -0.194801 | -0.235595 | -0.917228 | -0.013271 | 0.045046 | 0.481318 | 0.016963 |
| lead_time | 0.071842 | 0.291940 | 1.000000 | 0.039974 | 0.131236 | 0.126724 | 0.002354 | 0.083988 | 0.164783 | -0.000632 | 0.054295 | 0.009199 | 0.216950 | -0.125084 | 0.085962 | -0.071124 | -0.107035 | -0.171179 | 0.000014 | -0.179554 | 0.170008 | 0.072794 | -0.066332 | -0.115551 | -0.096560 | -0.301064 | -0.044753 | 0.069546 | 0.379746 | 0.155912 |
| arrival_date_year | 0.033724 | 0.016339 | 0.039974 | 1.000000 | -0.527603 | -0.540488 | -0.000531 | 0.021678 | 0.031759 | 0.065360 | -0.153309 | 0.106956 | 0.020824 | 0.010137 | -0.119911 | 0.029804 | 0.092994 | 0.036669 | 0.031141 | 0.037359 | -0.056813 | -0.006081 | 0.197919 | -0.012646 | 0.108873 | -0.017264 | 0.052665 | 0.051689 | -0.066119 | 0.032199 |
| arrival_date_month | 0.001231 | 0.010325 | 0.131236 | -0.527603 | 1.000000 | 0.995094 | -0.026152 | 0.017801 | 0.018585 | -0.015453 | 0.025273 | 0.000732 | 0.006861 | -0.031009 | 0.037337 | -0.021509 | -0.008502 | -0.005755 | 0.004465 | -0.043843 | 0.019088 | -0.030046 | 0.078780 | 0.000106 | 0.027651 | -0.020520 | 0.010292 | 0.026750 | 0.008325 | 0.020844 |
| arrival_date_week_number | 0.000739 | 0.007481 | 0.126724 | -0.540488 | 0.995094 | 1.000000 | 0.066824 | 0.017640 | 0.015006 | -0.017630 | 0.025995 | -0.001003 | 0.005277 | -0.030414 | 0.035366 | -0.020769 | -0.008544 | -0.005066 | 0.005183 | -0.042781 | 0.022992 | -0.028708 | 0.075256 | 0.001714 | 0.025788 | -0.016844 | 0.010484 | 0.024749 | 0.007368 | 0.018110 |
| arrival_date_day_of_month | -0.002129 | -0.006173 | 0.002354 | -0.000531 | -0.026152 | 0.066824 | 1.000000 | -0.015903 | -0.027589 | -0.007210 | -0.000647 | -0.004206 | 0.001584 | -0.006334 | -0.027009 | 0.000121 | 0.017142 | 0.012060 | 0.010779 | 0.004100 | 0.022741 | 0.012141 | 0.029980 | 0.008271 | 0.003050 | 0.011461 | 0.014693 | 0.006491 | -0.008706 | -0.026825 |
| stays_in_weekend_nights | -0.189729 | -0.002639 | 0.083988 | 0.021678 | 0.017801 | 0.017640 | -0.015903 | 1.000000 | 0.494890 | 0.044639 | -0.128056 | 0.114062 | 0.091153 | -0.087833 | -0.013007 | -0.040596 | 0.141698 | 0.087755 | 0.062402 | -0.014310 | -0.054566 | -0.110194 | 0.047319 | -0.018145 | 0.071654 | 0.009460 | 0.051830 | 0.100050 | -0.115016 | 0.760957 |
| stays_in_week_nights | -0.237767 | 0.024103 | 0.164783 | 0.031759 | 0.018585 | 0.015006 | -0.027589 | 0.494890 | 1.000000 | 0.035778 | -0.121238 | 0.107860 | 0.085830 | -0.097992 | -0.014273 | -0.047366 | 0.168659 | 0.102136 | 0.095664 | 0.017415 | -0.002160 | -0.128539 | 0.063647 | -0.024376 | 0.066778 | -0.020913 | 0.050410 | 0.100921 | -0.081017 | 0.940370 |
| meal | 0.006087 | -0.018679 | -0.000632 | 0.065360 | -0.015453 | -0.017630 | -0.007210 | 0.044639 | 0.035778 | 1.000000 | -0.088321 | 0.144589 | 0.116045 | -0.057239 | -0.003738 | -0.039031 | -0.122848 | -0.122136 | 0.024363 | -0.112870 | -0.007309 | 0.044441 | 0.057828 | -0.038135 | 0.022570 | 0.016378 | -0.042302 | -0.007011 | -0.091181 | 0.044187 |
| country | -0.041299 | 0.270254 | 0.054295 | -0.153309 | 0.025273 | 0.025995 | -0.000647 | -0.128056 | -0.121238 | -0.088321 | 1.000000 | -0.267296 | -0.125945 | 0.130210 | 0.077549 | 0.074624 | -0.100606 | -0.079197 | -0.040723 | 0.174500 | 0.060939 | 0.027014 | -0.114490 | 0.000958 | -0.165658 | -0.246277 | -0.039871 | -0.107640 | 0.334281 | -0.140649 |
| market_segment | 0.077583 | 0.056972 | 0.009199 | 0.106956 | 0.000732 | -0.001003 | -0.004206 | 0.114062 | 0.107860 | 0.144589 | -0.267296 | 1.000000 | 0.764934 | -0.252762 | -0.060032 | -0.175344 | 0.095087 | 0.030533 | -0.072222 | -0.507993 | -0.042469 | -0.167728 | 0.229628 | -0.059595 | 0.275221 | -0.059246 | 0.079899 | 0.209508 | -0.187179 | 0.125182 |
| distribution_channel | 0.167945 | 0.165596 | 0.216950 | 0.020824 | 0.006861 | 0.005277 | 0.001584 | 0.091153 | 0.085830 | 0.116045 | -0.125945 | 0.764934 | 1.000000 | -0.266411 | -0.022747 | -0.199790 | -0.042898 | -0.101512 | -0.114587 | -0.579849 | 0.048226 | -0.071206 | 0.087347 | -0.129843 | 0.098380 | -0.169370 | -0.000858 | 0.139269 | 0.100972 | 0.099766 |
| is_repeated_guest | -0.051353 | -0.085185 | -0.125084 | 0.010137 | -0.031009 | -0.030414 | -0.006334 | -0.087833 | -0.097992 | -0.057239 | 0.130210 | -0.252762 | -0.266411 | 1.000000 | 0.082376 | 0.423259 | -0.029825 | 0.032869 | 0.012166 | 0.215063 | -0.022322 | -0.017254 | -0.135374 | 0.077778 | 0.013145 | 0.083882 | -0.035257 | -0.137663 | -0.058639 | -0.107548 |
| previous_cancellations | -0.012334 | 0.109914 | 0.085962 | -0.119911 | 0.037337 | 0.035366 | -0.027009 | -0.013007 | -0.014273 | -0.003738 | 0.077549 | -0.060032 | -0.022747 | 0.082376 | 1.000000 | 0.154285 | -0.048902 | -0.058483 | -0.027091 | 0.017706 | 0.005927 | -0.008373 | -0.065922 | -0.018454 | -0.048587 | -0.110572 | -0.027292 | -0.020288 | 0.143478 | -0.015749 |
| previous_bookings_not_canceled | 0.000745 | -0.055495 | -0.071124 | 0.029804 | -0.021509 | -0.020769 | 0.000121 | -0.040596 | -0.047366 | -0.039031 | 0.074624 | -0.175344 | -0.199790 | 0.423259 | 0.154285 | 1.000000 | -0.020777 | 0.000774 | 0.011971 | 0.146949 | -0.009011 | -0.011543 | -0.069631 | 0.046946 | 0.037592 | 0.053281 | -0.022042 | -0.096288 | -0.030459 | -0.051257 |
| reserved_room_type | -0.252232 | -0.062228 | -0.107035 | 0.092994 | -0.008502 | -0.008544 | 0.017142 | 0.141698 | 0.168659 | -0.122848 | -0.100606 | 0.095087 | -0.042898 | -0.029825 | -0.048902 | -0.020777 | 1.000000 | 0.815509 | 0.044803 | 0.094304 | -0.069047 | -0.121279 | 0.392995 | 0.131736 | 0.136842 | 0.059591 | 0.324479 | 0.384603 | -0.201930 | 0.181397 |
| assigned_room_type | -0.307583 | -0.175882 | -0.171179 | 0.036669 | -0.005755 | -0.005066 | 0.012060 | 0.087755 | 0.102136 | -0.122136 | -0.079197 | 0.030533 | -0.101512 | 0.032869 | -0.058483 | 0.000774 | 0.815509 | 1.000000 | 0.096301 | 0.162714 | -0.068642 | -0.084245 | 0.261753 | 0.158899 | 0.124819 | 0.172399 | 0.294314 | 0.306438 | -0.246616 | 0.110611 |
| booking_changes | -0.073166 | -0.144669 | 0.000014 | 0.031141 | 0.004465 | 0.005183 | 0.010779 | 0.062402 | 0.095664 | 0.024363 | -0.040723 | -0.072222 | -0.114587 | 0.012166 | -0.027091 | 0.011971 | 0.044803 | 0.096301 | 1.000000 | 0.096194 | -0.011660 | 0.092161 | 0.019217 | 0.065727 | 0.052423 | 0.141062 | 0.078696 | -0.003885 | -0.119482 | 0.095855 |
| agent | -0.557676 | -0.127336 | -0.179554 | 0.037359 | -0.043843 | -0.042781 | 0.004100 | -0.014310 | 0.017415 | -0.112870 | 0.174500 | -0.507993 | -0.579849 | 0.215063 | 0.017706 | 0.146949 | 0.094304 | 0.162714 | 0.096194 | 1.000000 | -0.065868 | 0.100596 | -0.121253 | 0.166542 | -0.068537 | 0.124064 | -0.004207 | -0.131935 | -0.054546 | 0.007402 |
| days_in_waiting_list | 0.072011 | 0.054008 | 0.170008 | -0.056813 | 0.019088 | 0.022992 | 0.022741 | -0.054566 | -0.002160 | -0.007309 | 0.060939 | -0.042469 | 0.048226 | -0.022322 | 0.005927 | -0.009011 | -0.069047 | -0.068642 | -0.011660 | -0.065868 | 1.000000 | 0.099124 | -0.041325 | -0.030461 | -0.082972 | -0.057764 | -0.036442 | -0.026929 | 0.120183 | -0.022973 |
| customer_type | 0.047190 | -0.068698 | 0.072794 | -0.006081 | -0.030046 | -0.028708 | 0.012141 | -0.110194 | -0.128539 | 0.044441 | 0.027014 | -0.167728 | -0.071206 | -0.017254 | -0.008373 | -0.011543 | -0.121279 | -0.084245 | 0.092161 | 0.100596 | 0.099124 | 1.000000 | -0.077974 | -0.030066 | -0.135907 | 0.066554 | -0.060312 | -0.114158 | -0.086948 | -0.139109 |
| adr | 0.093117 | 0.046133 | -0.066332 | 0.197919 | 0.078780 | 0.075256 | 0.029980 | 0.047319 | 0.063647 | 0.057828 | -0.114490 | 0.229628 | 0.087347 | -0.135374 | -0.065922 | -0.069631 | 0.392995 | 0.261753 | 0.019217 | -0.121253 | -0.041325 | -0.077974 | 1.000000 | 0.058063 | 0.171404 | -0.049239 | 0.309410 | 0.365864 | -0.088928 | 0.066044 |
| required_car_parking_spaces | -0.217408 | -0.194801 | -0.115551 | -0.012646 | 0.000106 | 0.001714 | 0.008271 | -0.018145 | -0.024376 | -0.038135 | 0.000958 | -0.059595 | -0.129843 | 0.077778 | -0.018454 | 0.046946 | 0.131736 | 0.158899 | 0.065727 | 0.166542 | -0.030461 | -0.030066 | 0.058063 | 1.000000 | 0.082666 | 0.178677 | 0.069876 | 0.049655 | -0.094618 | -0.025303 |
| total_of_special_requests | -0.043983 | -0.235595 | -0.096560 | 0.108873 | 0.027651 | 0.025788 | 0.003050 | 0.071654 | 0.066778 | 0.022570 | -0.165658 | 0.275221 | 0.098380 | 0.013145 | -0.048587 | 0.037592 | 0.136842 | 0.124819 | 0.052423 | -0.068537 | -0.082972 | -0.135907 | 0.171404 | 0.082666 | 1.000000 | 0.226555 | 0.128116 | 0.156038 | -0.268715 | 0.077908 |
| reservation_status | -0.121841 | -0.917228 | -0.301064 | -0.017264 | -0.020520 | -0.016844 | 0.011461 | 0.009460 | -0.020913 | 0.016378 | -0.246277 | -0.059246 | -0.169370 | 0.083882 | -0.110572 | 0.053281 | 0.059591 | 0.172399 | 0.141062 | 0.124064 | -0.057764 | 0.066554 | -0.049239 | 0.178677 | 0.226555 | 1.000000 | 0.013380 | -0.053924 | -0.478602 | -0.011911 |
| is_family | -0.059504 | -0.013271 | -0.044753 | 0.052665 | 0.010292 | 0.010484 | 0.014693 | 0.051830 | 0.050410 | -0.042302 | -0.039871 | 0.079899 | -0.000858 | -0.035257 | -0.027292 | -0.022042 | 0.324479 | 0.294314 | 0.078696 | -0.004207 | -0.036442 | -0.060312 | 0.309410 | 0.069876 | 0.128116 | 0.013380 | 1.000000 | 0.580316 | -0.106983 | 0.057927 |
| total_customer | -0.045235 | 0.045046 | 0.069546 | 0.051689 | 0.026750 | 0.024749 | 0.006491 | 0.100050 | 0.100921 | -0.007011 | -0.107640 | 0.209508 | 0.139269 | -0.137663 | -0.020288 | -0.096288 | 0.384603 | 0.306438 | -0.003885 | -0.131935 | -0.026929 | -0.114158 | 0.365864 | 0.049655 | 0.156038 | -0.053924 | 0.580316 | 1.000000 | -0.082170 | 0.114517 |
| deposit_given | 0.170543 | 0.481318 | 0.379746 | -0.066119 | 0.008325 | 0.007368 | -0.008706 | -0.115016 | -0.081017 | -0.091181 | 0.334281 | -0.187179 | 0.100972 | -0.058639 | 0.143478 | -0.030459 | -0.201930 | -0.246616 | -0.119482 | -0.054546 | 0.120183 | -0.086948 | -0.088928 | -0.094618 | -0.268715 | -0.478602 | -0.106983 | -0.082170 | 1.000000 | -0.105514 |
| total_nights | -0.251797 | 0.016963 | 0.155912 | 0.032199 | 0.020844 | 0.018110 | -0.026825 | 0.760957 | 0.940370 | 0.044187 | -0.140649 | 0.125182 | 0.099766 | -0.107548 | -0.015749 | -0.051257 | 0.181397 | 0.110611 | 0.095855 | 0.007402 | -0.022973 | -0.139109 | 0.066044 | -0.025303 | 0.077908 | -0.011911 | 0.057927 | 0.114517 | -0.105514 | 1.000000 |
# ## using seaborn
# plt.figure(figsize = (24, 12))
# cmap = sns.diverging_palette(230, 20, n=256, as_cmap=True)
# sns.heatmap(corr_df,
# cmap=cmap,
# vmax=1,
# vmin = -.25,
# center=0,
# square=True,
# linewidths=.5,
# annot = True,
# fmt='.2f',
# annot_kws={'size': 10},
# cbar_kws={"shrink": .75}
# )
# plt.show()
# Draw heatmap using plolty
import plotly.figure_factory as ff
mask = np.triu(np.ones_like(corr_df, dtype=bool))
df_mask = corr_df.mask(mask)
hmap = ff.create_annotated_heatmap(z=np.around(df_mask.to_numpy(),2),
x=df_mask.columns.tolist(),
y=df_mask.columns.tolist(),
colorscale=px.colors.diverging.RdBu,
showscale=True, ygap=1, xgap=1
)
hmap.update_xaxes(side="bottom")
hmap.update_layout(
title_text='Feature Correlation Heatmap'.upper(),
width=1000,
height=1000,
xaxis_showgrid=False,
yaxis_showgrid=False,
xaxis_zeroline=False,
yaxis_zeroline=False,
yaxis_autorange='reversed',
template='plotly_white'
)
for i in range(len(hmap.layout.annotations)):
hmap.layout.annotations[i].font.size = 8
hmap.layout.annotations[i].font.color = 'black'
if hmap.layout.annotations[i].text == 'nan':
hmap.layout.annotations[i].text = ""
hmap.show()
# correlation between each column and cancelation
corr_df.is_canceled.sort_values()
reservation_status -0.917228 total_of_special_requests -0.235595 required_car_parking_spaces -0.194801 assigned_room_type -0.175882 booking_changes -0.144669 agent -0.127336 is_repeated_guest -0.085185 customer_type -0.068698 reserved_room_type -0.062228 previous_bookings_not_canceled -0.055495 meal -0.018679 is_family -0.013271 arrival_date_day_of_month -0.006173 stays_in_weekend_nights -0.002639 arrival_date_week_number 0.007481 arrival_date_month 0.010325 arrival_date_year 0.016339 total_nights 0.016963 stays_in_week_nights 0.024103 total_customer 0.045046 adr 0.046133 days_in_waiting_list 0.054008 market_segment 0.056972 previous_cancellations 0.109914 hotel 0.133990 distribution_channel 0.165596 country 0.270254 lead_time 0.291940 deposit_given 0.481318 is_canceled 1.000000 Name: is_canceled, dtype: float64
* reservation_status -0.917196
* deposit_given -0.481457
* total_of_special_requests -0.234658
reservation_status seems to be most impactful feature. With that information accuracy rate should be really high.
* arrival_date_day_of_month -0.006130
* stays_in_weekend_nights -0.001791
* arrival_date_week_number 0.008148
* arrival_date_year 0.016339
* agent -0.130010
Backing to the agent column which still have some missing values. It has nice importance on predicting cancellation by correlation (-0.130010) but since the missing values are equal to 13% of the total data it is better to drop that column.
* deposit_given -0.481457
* is_family -0.013010
* total_nights 0.017779
* total_customer 0.046522
I will drop total_nights, is_family as it has low correlation with the cancelation
['total_nights','is_family','arrival_date_week_number', 'stays_in_weekend_nights', 'arrival_date_month', 'agent'
df_8 = df_7.drop(columns = ['total_nights','is_family','arrival_date_week_number', 'stays_in_weekend_nights','arrival_date_year', 'arrival_date_month', 'agent'], axis = 1)
df_8.shape
(118902, 23)
# check nulls
l = df_8.isna().values.sum()
print('nulls in datasets = {}'.format(l))
nulls in datasets = 0
# check data types of the features
df_8.dtypes
hotel int64 is_canceled int64 lead_time int64 arrival_date_day_of_month int64 stays_in_week_nights int64 meal int32 country int32 market_segment int32 distribution_channel int32 is_repeated_guest int64 previous_cancellations int64 previous_bookings_not_canceled int64 reserved_room_type int32 assigned_room_type int32 booking_changes int64 days_in_waiting_list int64 customer_type int32 adr float64 required_car_parking_spaces int64 total_of_special_requests int64 reservation_status int32 total_customer float64 deposit_given int64 dtype: object
df_8.to_csv('HotelCancelation.csv', index = False)
data = pd.read_csv('HotelCancelation.csv')
print(data.shape)
data.head()
(118902, 23)
| hotel | is_canceled | lead_time | arrival_date_day_of_month | stays_in_week_nights | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | total_customer | deposit_given | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 342 | 1 | 0 | 0 | 135 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 3 | 0 | 2 | 0.0 | 0 | 0 | 1 | 2.0 | 0 |
| 1 | 0 | 0 | 737 | 1 | 0 | 0 | 135 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 4 | 0 | 2 | 0.0 | 0 | 0 | 1 | 2.0 | 0 |
| 2 | 0 | 0 | 7 | 1 | 1 | 0 | 59 | 3 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 75.0 | 0 | 0 | 1 | 1.0 | 0 |
| 3 | 0 | 0 | 13 | 1 | 1 | 0 | 59 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 75.0 | 0 | 0 | 1 | 1.0 | 0 |
| 4 | 0 | 0 | 14 | 1 | 2 | 0 | 59 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 98.0 | 0 | 1 | 1 | 2.0 | 0 |
data.var()
hotel 0.222117 is_canceled 0.233457 lead_time 11428.278631 arrival_date_day_of_month 77.094910 stays_in_week_nights 3.610625 meal 1.143925 country 1995.974199 market_segment 1.591036 distribution_channel 0.812963 is_repeated_guest 0.030985 previous_cancellations 0.715470 previous_bookings_not_canceled 2.204177 reserved_room_type 2.876830 assigned_room_type 3.517656 booking_changes 0.426116 days_in_waiting_list 310.822571 customer_type 0.333933 adr 2548.937594 required_car_parking_spaces 0.059618 total_of_special_requests 0.628339 reservation_status 0.248077 total_customer 0.521122 deposit_given 0.107542 dtype: float64
# normalizing numerical variables
nor_data = data.copy()
nor_data['lead_time'] = np.log(data['lead_time'] + 1)
nor_data['arrival_date_day_of_month'] = np.log(data['arrival_date_day_of_month'] + 1)
nor_data['days_in_waiting_list'] = np.log(data['days_in_waiting_list'] + 1)
nor_data['country'] = np.log(data['country'] + 1)
nor_data['adr'] = np.log(data['adr'] + 1)
nor_data.var()
hotel 0.222117 is_canceled 0.233457 lead_time 2.573840 arrival_date_day_of_month 0.506135 stays_in_week_nights 3.610625 meal 1.143925 country 0.412414 market_segment 1.591036 distribution_channel 0.812963 is_repeated_guest 0.030985 previous_cancellations 0.715470 previous_bookings_not_canceled 2.204177 reserved_room_type 2.876830 assigned_room_type 3.517656 booking_changes 0.426116 days_in_waiting_list 0.505972 customer_type 0.333933 adr 0.536856 required_car_parking_spaces 0.059618 total_of_special_requests 0.628339 reservation_status 0.248077 total_customer 0.521122 deposit_given 0.107542 dtype: float64
# check nulls
l = nor_data.isna().values.sum()
print('nulls in datasets = {}'.format(l))
nulls in datasets = 1
nor_data = nor_data.dropna()
l = nor_data.isna().values.sum()
print('nulls in datasets = {}'.format(l))
nulls in datasets = 0
print(nor_data.shape)
nor_data.head()
(118901, 23)
| hotel | is_canceled | lead_time | arrival_date_day_of_month | stays_in_week_nights | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | total_customer | deposit_given | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 5.837730 | 0.693147 | 0 | 0 | 4.912655 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 3 | 0.0 | 2 | 0.000000 | 0 | 0 | 1 | 2.0 | 0 |
| 1 | 0 | 0 | 6.603944 | 0.693147 | 0 | 0 | 4.912655 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 4 | 0.0 | 2 | 0.000000 | 0 | 0 | 1 | 2.0 | 0 |
| 2 | 0 | 0 | 2.079442 | 0.693147 | 1 | 0 | 4.094345 | 3 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0.0 | 2 | 4.330733 | 0 | 0 | 1 | 1.0 | 0 |
| 3 | 0 | 0 | 2.639057 | 0.693147 | 1 | 0 | 4.094345 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 2 | 4.330733 | 0 | 0 | 1 | 1.0 | 0 |
| 4 | 0 | 0 | 2.708050 | 0.693147 | 2 | 0 | 4.094345 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 2 | 4.595120 | 0 | 1 | 1 | 2.0 | 0 |
reservation_status feature has very high corration with the cancelation action it make the model score almost 99% accuracy.
SO, I tried to build my models without this columns and compare between the diffrent models
X_resv = nor_data.drop(['is_canceled'], axis = 1)
y = nor_data['is_canceled']
print('X_resv shape: {}\ny shape: {}'.format(X_resv.shape, y.shape))
X_resv shape: (118901, 22) y shape: (118901,)
test_size = 0.3
myseed = 42
X_resv_train, X_resv_test, y_train, y_test = train_test_split(X_resv, y, test_size = test_size, random_state = myseed)
print('X train shape: {}\ny train shape: {}\n'.format(X_resv_train.shape, y_train.shape))
print('X test shape: {}\ny test shape: {}'.format(X_resv_test.shape, y_test.shape))
X train shape: (83230, 22) y train shape: (83230,) X test shape: (35671, 22) y test shape: (35671,)
X = nor_data.drop(['is_canceled','reservation_status'], axis = 1)
y = nor_data['is_canceled']
print('X shape: {}\ny shape: {}'.format(X.shape, y.shape))
X shape: (118901, 21) y shape: (118901,)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = myseed)
print('X train shape: {}\ny train shape: {}\n'.format(X_train.shape, y_train.shape))
print('X test shape: {}\ny test shape: {}'.format(X_test.shape, y_test.shape))
X train shape: (83230, 21) y train shape: (83230,) X test shape: (35671, 21) y test shape: (35671,)
# Building
lg1=LogisticRegression()
# training
lg1.fit(X_resv_train, y_train)
LogisticRegression()
# testing
lg1_y_pred = lg1.predict(X_resv_test)
print('y pred: ',lg1_y_pred[:30])
print('y actual: ',y_test[:30].values)
y pred: [0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1] y actual: [0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1]
# Evaluation
lg1_acc = accuracy_score(y_test, lg1_y_pred)*100
lg1_conf = confusion_matrix(y_test, lg1_y_pred)
lg1_report = classification_report(y_test, lg1_y_pred)
print('#### The {} ####\n'.format('Logistic Regression (with reservation_status feature)'))
print("Accuracy Score of Logistic Regression is: \n{:2.2f}%\n".format(lg1_acc))
print("Confusion Matrix of Logistic Regression is:\n{}\n".format(lg1_conf))
print("Classification Report of Logistic Regression is:\n{}\n".format(lg1_report))
print('################################# End #################################\n')
#### The Logistic Regression (with reservation_status feature) ####
Accuracy Score of Logistic Regression is:
98.93%
Confusion Matrix of Logistic Regression is:
[[22331 7]
[ 375 12958]]
Classification Report of Logistic Regression is:
precision recall f1-score support
0 0.98 1.00 0.99 22338
1 1.00 0.97 0.99 13333
accuracy 0.99 35671
macro avg 0.99 0.99 0.99 35671
weighted avg 0.99 0.99 0.99 35671
################################# End #################################
# Building
lg2=LogisticRegression()
# training
lg2.fit(X_train, y_train)
LogisticRegression()
# testing
lg2_y_pred = lg2.predict(X_test)
print('y pred: ',lg2_y_pred[:30])
print('y actual: ',y_test[:30].values)
y pred: [0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1] y actual: [0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1]
# Evaluation
lg2_acc = accuracy_score(y_test, lg2_y_pred)*100
lg2_conf = confusion_matrix(y_test, lg2_y_pred)
lg2_report = classification_report(y_test, lg2_y_pred)
print('#### The {} ####\n'.format('Logistic Regression (without reservation_status feature)'))
print("Accuracy Score of Logistic Regression is: \n{:2.2f}%\n".format(lg2_acc))
print("Confusion Matrix of Logistic Regression is:\n{}\n".format(lg2_conf))
print("Classification Report of Logistic Regression is:\n{}\n".format(lg2_report))
print('################################# End #################################\n')
#### The Logistic Regression (without reservation_status feature) ####
Accuracy Score of Logistic Regression is:
80.12%
Confusion Matrix of Logistic Regression is:
[[20888 1450]
[ 5641 7692]]
Classification Report of Logistic Regression is:
precision recall f1-score support
0 0.79 0.94 0.85 22338
1 0.84 0.58 0.68 13333
accuracy 0.80 35671
macro avg 0.81 0.76 0.77 35671
weighted avg 0.81 0.80 0.79 35671
################################# End #################################
# hyper-parameter
depth = 15
# Building
dt = DecisionTreeClassifier(max_depth = depth)
# training
dt.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=15)
# testing
dt_y_pred = dt.predict(X_test)
print('y pred: ',dt_y_pred[:30])
print('y actual: ',y_test[:30].values)
y pred: [0 0 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 1] y actual: [0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1]
# Evaluation
dt_acc = accuracy_score(y_test, dt_y_pred)*100
dt_conf = confusion_matrix(y_test, dt_y_pred)
dt_report = classification_report(y_test, dt_y_pred)
print('####################### The {} Classifier ###################\n'.format('Decision Tree'))
print("Accuracy Score of Logistic Regression is: \n{:2.2f}%\n".format(dt_acc))
print("Confusion Matrix of Logistic Regression is:\n{}\n".format(dt_conf))
print("Classification Report of Logistic Regression is:\n{}\n".format(dt_report))
print('################################# End #################################\n')
####################### The Decision Tree Classifier ###################
Accuracy Score of Logistic Regression is:
83.89%
Confusion Matrix of Logistic Regression is:
[[19804 2534]
[ 3214 10119]]
Classification Report of Logistic Regression is:
precision recall f1-score support
0 0.86 0.89 0.87 22338
1 0.80 0.76 0.78 13333
accuracy 0.84 35671
macro avg 0.83 0.82 0.83 35671
weighted avg 0.84 0.84 0.84 35671
################################# End #################################
# hyper-parameters
boost = 'gbtree'
lr = 0.1
n_estim = 500
depth = 20
# Building
xgb = XGBClassifier(booster = boost, learning_rate = lr, max_depth =depth, n_estimators = n_estim)
# training
xgb.fit(X_train, y_train)
[04:33:59] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=20,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=500, n_jobs=8, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
# testing
xgb_y_pred = xgb.predict(X_test)
print('y pred: ',xgb_y_pred[:30])
print('y actual: ',y_test[:30].values)
y pred: [0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1] y actual: [0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1]
# Evaluation
xgb_acc = accuracy_score(y_test, xgb_y_pred)*100
xgb_conf = confusion_matrix(y_test, xgb_y_pred)
xgb_report = classification_report(y_test, xgb_y_pred)
print('####################### The {} Classifier #####################\n'.format('XgBoost'))
print("Accuracy Score of Logistic Regression is: \n{:2.2f}%\n".format(xgb_acc))
print("Confusion Matrix of Logistic Regression is:\n{}\n".format(xgb_conf))
print("Classification Report of Logistic Regression is:\n{}\n".format(xgb_report))
print('################################# End #################################\n')
####################### The XgBoost Classifier #####################
Accuracy Score of Logistic Regression is:
87.74%
Confusion Matrix of Logistic Regression is:
[[20573 1765]
[ 2608 10725]]
Classification Report of Logistic Regression is:
precision recall f1-score support
0 0.89 0.92 0.90 22338
1 0.86 0.80 0.83 13333
accuracy 0.88 35671
macro avg 0.87 0.86 0.87 35671
weighted avg 0.88 0.88 0.88 35671
################################# End #################################
# hyper-parametes
rf_hparams = {"max_depth": [16,18,20],
"n_estimators": [100,500],
"min_samples_split": [2,5]}
cv = 5
# Building
rf = GridSearchCV(RandomForestClassifier(),rf_hparams,cv=cv,n_jobs = -1,verbose = 2)
# training RF using CV
rf.fit(X_train, y_train)
Fitting 5 folds for each of 12 candidates, totalling 60 fits
GridSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
param_grid={'max_depth': [16, 18, 20], 'min_samples_split': [2, 5],
'n_estimators': [100, 500]},
verbose=2)
rf.best_params_
{'max_depth': 20, 'min_samples_split': 2, 'n_estimators': 500}
# build rf using best parameters
best_rf = RandomForestClassifier(max_depth = rf.best_params_['max_depth'],
min_samples_split = rf.best_params_['min_samples_split'],
n_estimators = rf.best_params_['n_estimators'])
# training best RF model
best_rf.fit(X_train, y_train)
RandomForestClassifier(max_depth=20, n_estimators=500)
# testing
best_rf_y_pred = best_rf.predict(X_test)
print('y pred: ',best_rf_y_pred[:30])
print('y actual: ',y_test[:30].values)
y pred: [0 0 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 1] y actual: [0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1]
# Evaluation
best_rf_acc = accuracy_score(y_test, best_rf_y_pred)*100
best_rf_conf = confusion_matrix(y_test, best_rf_y_pred)
best_rf_report = classification_report(y_test, best_rf_y_pred)
print('####################### The {} Classifier #######################\n'.format('Random Forest'))
print("Accuracy Score of Logistic Regression is: \n{:2.2f}%\n".format(best_rf_acc))
print("Confusion Matrix of Logistic Regression is:\n{}\n".format(best_rf_conf))
print("Classification Report of Logistic Regression is:\n{}\n".format(best_rf_report))
print('################################# End #################################\n')
####################### The Random Forest Classifier #######################
Accuracy Score of Logistic Regression is:
86.98%
Confusion Matrix of Logistic Regression is:
[[20876 1462]
[ 3183 10150]]
Classification Report of Logistic Regression is:
precision recall f1-score support
0 0.87 0.93 0.90 22338
1 0.87 0.76 0.81 13333
accuracy 0.87 35671
macro avg 0.87 0.85 0.86 35671
weighted avg 0.87 0.87 0.87 35671
################################# End #################################
models = ['LR (with reservation)',
'LR (without reservation)',
'DT Classifier','XgBoost Classifier',
'RF Classifier using CV']
model_accs = [lg1_acc,lg2_acc,dt_acc,xgb_acc,best_rf_acc]
conclusion_df = pd.DataFrame({'Model Name':models,'Accuracy (%)':np.around(model_accs,2)})
conclusion_df
| Model Name | Accuracy (%) | |
|---|---|---|
| 0 | LR (with reservation) | 98.93 |
| 1 | LR (without reservation) | 80.12 |
| 2 | DT Classifier | 83.89 |
| 3 | XgBoost Classifier | 87.74 |
| 4 | RF Classifier using CV | 86.98 |
px.bar(conclusion_df,y='Accuracy (%)',x='Model Name',color='Accuracy (%)',
title = 'Models comparasion'.upper())